DOCUMEMT RESUHS 



ED 344 937 



TM 018 299 



AUTHOR 
TITLE 



PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Ackerman» Terry A.| Evans, John A, 

An Investigation of the Relationship betireen 

Reliability I Power, and the Type I Error Rate -of the 

Kantel'Haenszel and Simultaneous Item Bias Detection 

Procedures . 

Apr 92 

32**.; Paper presented at the Annual Meeting of the 
National Council on Measurement m ^ucation (San 
Francisco, CA# April 21-23, 1992). 
Reports - Evaluative/Feasibility (142) — 
Speeches/Conference Papers (150) 

MF01/PC02 Plus Postage. 

comparative Analysis; Equations (Mathematics); »Error 
of Measurement; Aitem Bias; «Nathematical Models; 
Monte Carlo Methods; Rav scores; «Sample Size; Test 
Items; «Test Reliability 

Ability Estimates; «Mantel Haenszel Procedure; Power 
(Statistics); •Simultaneous Item Bias Procedure; Type 
I Errors 



ABSTRACT 

The relationship t^tween levels of reliability and 
the power of two bias and differential item functioning (DIF) 
detection methods is examined. Both methods, the Mantel-Haenszel (MH) 
procedure of P. W. Holland and D. T. Thayer (1988) and the 
Simultaneous Item Bias (SIB) procedure of R. Shealy and W. stout 
(1991), use examinees* raw scores as a conditioning variable in the 
computation of differential performance between two groups of 
interest. As a result, the extent to which examinees* observed scores 
accurately reflect their true abilities plays an imprrtant role. If 
examinees are misrepresented by their observed scor« (as for a test 
with low reliability) then the ability of bias detection methods to 
determine item bias may not be very accurate. Results of Monte Carlo 
studies (40-ltem test, 720 testing conditions) suggest that for a 
fixed-length test, the power of both statistics increases moderately 
as reliability is increased and substantially as sample size is 
increased. However, the combination of small sample sizes and high 
reliability results in a decrease of power. For most of the simulated 
conditions, the MH and SlB procedures have very similar rates of 
correctly rejecting the biased item. Sixteen plots illustrate the 
discussion. There is a 15-ltem list of references. (SLD) 



* Reproductions supplied by EDRS are the best that can be made 

• from the original document. 



t 



CO 



CO 



u s wEwwmnfiT or EOWCAxioti 

Office crt fcTiucaiKwi* 8wB»»cn »nO miproifemenf 
EDUCATfOI^ RESOURCES INFORMATION 

ang<nat<n{| •( 
n Minof t^iOflBS ««»" «npn).« 
f^grotfitction Q«»li1» 

mem (JO not nm:e»safti» '•pwwnt o»tiC-*i 
OtRi position oi poJiCT 



"PERMISSION TO REPRODUCE THIS 
MATERIAL MAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC). " 



An Investigaticm of the Relationship Between Reiiabili^, 
Power and the Type I Error Rate of the Nfantel-Haenszel 
and Simultaiwoiis Item Bias Detection Procediues 



Terry A. Ackerman 

Jc^ A. Evans 
University of Illinois 



ERIC 



presented at the 1992 NCME Annual Meeting, San Francisco, CA., April 21, 1992. 



BESTGOFUlilSiLI 



Abstract 



This study examines the relationship between levels of reliability and the power of two 
bias and differential item functioning (DIF) detection methods. Both methods, the Mantel- 
Haenszel (MH) (Holland & Thayer, 1988) and the Simultaneous Item Bias (SIB) (Shealy & Stout, 
1991), use examine^' raw scores as a conditioning variable in the computation of differential 
performance between two groups of interest. As a result, the extent to which examinees' 
observed scores accurately reflect their true abilities plays an important role. If examinees are 
misrepresoited by their c^served score (as for a t^ with low reliability) then the ability of bias 
detection methods to determine item bias may not be very accurate. Results suggest that for a 
fixed length test, the power of both statistics increases moderately as reliability is increased and 
substantially sample size increased. However, the combination of small sample sizes and high 
relibility resulted in a decrease of power. For most of the simulated conditions the MH 
procedure and SIB had very similar rates of correctly rejecting the biased item. 



3 

An Investigation of the Relationslilp Between Reliability, Power 
and the Type I Error Rate of the Mantel-Haenszei and Simuttaneous Item 

Bias Detection Procedures 

Objectives of the Study 

The purpose of this study was to provide the testing practitioner with information 
concerning the interaction between test reliability and the accuracy of two item bias and 
differential item functioning (DIF) detection procedures.* The power and Type 1 error rate of 
two bias detection methwls, the Mantel-Haenszel (MH) procedure (Holland & Thayer, 1988) 
and the Simultaneous Item Bias (SIB) detection procedure (Shealy & Stout, 1991), were 
examined. The MH procedure has developed into a nonparametric baichmark test that is widely 
used by many testing : r^rtitioners. SIB, a relatively new procedure, is also nonparametric. 
Unlike the MH pr ;edure, SIB can evaluate the collective bias of more than one item. 

Both procedure use raw score as a conditioning variable to form groups of comparable" 
ability examinees. Consequ«illy, it is imperative that the test be equally reliable for both groups 
and that each examinee's observed score be an accurate indication of their true ability. The 
concern that prompted this study was that if the reliability of a test were to decrease, the power 
of MH and/or SIB to detect bias may suffer. The shape of the observed score distribution is a 

fimction of the test reliability (cf. Lord, 1953). If a test has low reliability (e.g., p^, 

-JO) the resulting observed score distribution tends to be leptokurtic with a relatively small 
variance. As the items become more discriminating and reliability increase the observed score 
distribution changes to a more platykurtic, uniform shape. At very high levels of reliability 

{ - -95). the distribution becomes U-shaped with examinees being grouped in the tails. 

Hence, the level of reliability, because of its effect on how examinees are spread out along the 
raw score scale, could effect the power of the two procedures to accurately detect biased items. 

A second area of concern when using these procedures is the number of examines in the 
two groups of interest. Often practitioners do not have the luxury of having a large sample of 
minority subjects, who are usually the focal group of interest. It is quite common for these 



o 

ERIC 



4 



* 



4 

groups to be greatly under-represented in the group total examinee population. Thus, the effects 
of different ratios of focal versus reference group sizes were studied. 

A third issue that was addressed in this study was the amount of bias. Because this study 
is basically focusing on the power of each test as reliability and sample sizes are varied, the 
effect size or amount of bias had to be varied. In this study the effect size was defined as the 
amount of angular difference betwetti the measurement direction (i.e., direction of maximum 
information) of the item and the measurement direction of the test. This concept will be 
discussed in detail later. 

The final factor that was varied in this study was the number of biased items. In cases 
of no bias the Type I error rates of both procedures were examined. When one item was 
simulated as being biased the power of each procedure to make a correct rejection was 
investigated. For multiple biased items the study focused only on the power of SIB, considering- 
its performance as the reliability and the sample sizes of reference and focal groups were varied 
in selected combinations. 

Theoretical Background 

Multidimensional WT 

For simplicity, in this paper bias will be 4»xaiuned from a two^mensional perspective 
in which one dimension r^reseits flie pure, intcnded-to-be-measured ability, denoted by 6 and 
the other dimoision r^nesoits the nuisance abilities, denoted by i^ . The ability x\ represents 
a skill that is not intended to be measured, but may be used by examinees to sdve an item with 
a potential for bias. The work of Reckase (1986), which formally defines multidim«isional item 
response theory (MIRT) item characteristics, provides an excelloit foundation from which to 
examine the interaction between multidimensional items and the underlying multidimensional 
ability distributions for groups of interest. 

Reckase's work is based upon the MIRT compensatory model (M2PL) which for the 
purposes of this paper will be expressed in terms of the true ability dimension, 8, and the 

nuisance dimension, r\ . The probability of a correct response to item i by examinee y can be 
writt^ as 



ERIC 



5 



I 



5 



(1) 



where is the score (0,1) on item i by person;, aj is the vector of item discrimination 
parameters, is a scalar difficulty parameter of item /, and Bj,r\j is the vector of ability 
parameters for person j. 

In a two-dimensional latent ability ^Mce (e.g., math and verbal ability dimaisions), the 
a,i and vectors designate the composite of 6 and ,^ that item i is measuring. If a,i = aji, 
both dimensions would be measured equally well. However, if a,; = 0 and = 1.0, 
discrimination would occur only along the ^ dimension. If all of the items in a test are 
measuring exactly the same (e,i|) composite (i.e., the same "directiOT" in the (q,^) coordinate 
system), the test would be strictly unidimensional. The more varied the composites that art 
being assessed, the more multidimenaonal the test. 

Reckasc's (1986) work describes how to graphically represent an item that requires the 
jqi^plication of multiple abilities as vectors in a multidimensional latent space. The length of the 
vector for item / is equal to the degree of multidimensional discrimination, MDISC. This can 
be computed using the formula 



MDISC is analogous to the unidimensional IRT model's discrimination parameter. The 
measurement direction of the vector in degrees from the positive 8 axis is 



This reference angle represents the composite of the 0-,^ ability space that item i is best 
measuring. 

The item vector originates at, and is giaphai orthogonal to, the p=.5 equiptobability 
contour. In the compensatory model d^cribed in (1) these equiprobability contours are always 
parallel. 




(2) 



MDISQ 



(3) 



ERIC 



For item i, the distance, D;, from the origin to the p=. 5 contour, is computed as 

Dj is analogous to the unidimensional IRT difficulty parameter. Because the discrimination 
parameters can never be native, the item vectors can lie only in the third quadrant 
(representing easy items) or in the first quadrant (representing more difficult items). Figure 1 
illustrates the item response surface for a M2PL item vector whose parameters are: a| « ! .8, 
a2=.3, and d=.5. Also illustrated in the bottom portion of Figure 1 »i the item's vector, 
superimposed upon the equiprobability contou:s of the response surface- 



Insert Figure 1 about here 

Definition of bias 

Bias, according to Shealy & Stout (1989, 1991) and Kok (1988), should be conceptuaUzed 
by examining the differoicc in cwtain marginal item characteristic curves (ICCs) for the two 
groups of interest. The marginal ICC for a particular group is computed by 

nxr^ I e-e)-/pae.Ti]/(,ii0)jn 0) 

where P^e,i\) is the M2PL re^xmse function defined in (1) and f{i\\B) is the specified group's 
conditional distribution of the nuisance dimension , , given a fixed value of 8, the target ability. 
For a fixed 0, [8,^ J varies with t| . For a fixed value of 6, (x^ - 1 ie - 6) is obtained by 
averaging p^ J0,t,] over r\ - Specifically, p^(x^ - iJe - 0) is the unidimensional ICC that wUl be 
obtained if dififerHicK in the nuisance direction are integrated out. It awnoximatcs the ICC Jiai 
would be obtained via calibraticm using a unidimensionality based computer program such as 
BILOG (Mislevy & Bock, 1983) If the t^ were strictly unidimensional (i.e., if there would be 
no nuisance dimeision ). It is important to note that if /(t^i8) is the same for both groups, 
bias cannot occur because examinees of equal 0 ability will have the same prt^iUty of getting 
the item right. 



7 

Bias detection methods 

Although there has been a proJiferaiion of methods to detect item bias this paper will 
focus on only two: the Mantel-Haenzsel (MH) strategy (Holland & Thayer, 1988) and Shealy and 
Stout's SIB (1991). Both of these procedures are nonparametric and thus require no model 
calibration. They both have IRT justifications and yet because they do not require IRT 
calibration, they are computationally non-intensive. To fecilitate understanding they will be 
discussed within the IRT context developed above. 

The MH procedure, when placed in a unidimer sional IRT framework, examines item bias 
using the one-parameter Rasch model. In this model all items are assumed to be equal in 
discrimination and to vary only in difficulty, tenuous assumptions at best. As such, the MH 
procedure is designed to be primarily sensitive to uniform bias. An item displays uniform bias 
if the ICCs for two groups of int«est differ by only a horizontal translation (i.e., they are. 
"parallel- but not coincident). It is important to note that if the response process is modeled 
using the 2PL or 3PL IRT models, the ICCs may be non-parallel, causing non-uniform bias. 
Then the IRT MH theory may not apply (see Zwick (1990) for a more complete discussion). 
By including the suspect item in the matching criterion it can be shown under the Rasch 
framework (Holland & Thayer, 1988) that when all of the items, except the suspect item, exhibit 
no bias, the procedure partials out the effect due to impact in the case of the Rasch model. 

The MH-CHISQ has an approximate chi-square distributiwi with one degree of freedom. 
However, one cannot tell from this statistic whether a significant value means the item favors the 
reference or the focal group. To determine the direction of bias one could use the MH estimator 
«jar» ^^^^^ represents the average fector by which the odds of a reference group member 
successfully responding to the studied item exceeds the corresponding odds for a matched 
member of the focal group. A value greater than one implies that the reference group 
outperformed the focal gitoup. Because the odds-ratio scale is not symmetric (i.e., it has a scale 
of 0 tj with a = 1 representing no bias) it is convenient to take the log of fi This new 

MB 

statistic, A uut indicates the amount of bias. 



s 



8 

In the case of the Rasch model, a and be expressed as 

A^^.-2.35(iy-^^ (8) 
where bf and b, arc the difficulty parameters for the marginal ICCs of the studied item given in 
(1) above for the focal and reference groups, respectively. The A index reprcs^Mits the 

Ml/ ^ 

difference in the mean horizontjii distance between the marginal ICCs. (Note that when a MH 
< 0, the studied item is biased against the focal group.) The horizontal distance between ICCs 
is used to ass^ the amount of bias being represented by the differences in the odds ratio at each 
score level for the two groups of interest. Some studies (Shealy, 1989: Shealy & Stout, 1991) 
have reported that the MH chi-square procedure is rrasonably robust against inflated Type I error 
when impact is present for many IRT models as well as robust again^ loss of power when 
nonunifbrm bias is prwent (ev«i if the generating model is a 2PL or a 3PL IRT model). Impact . 
is defined as the proportion correct differences that occur only on the valid skill. Zwick's ( 1990) 
work shows that if the correct model is 2PL or 3PL, the Type I error for MH can be seriously 
inflated. Recent work by Roussos (1992) found the MH Type I &tot rate for many such models 
to be inflated to a larger degree than the SIB procedure. 

Shealy and Stout (1991) have a similar theoretical item (and test) bias index called b^^ 
which, in the IRT context, is the average vertical distance between the marginal ICCs of the 
studied item with respect to 8 (the vaUd subtest abiUty). This index has a simple empirical 
interpretation. It is the average difference in probability of a correct re^xmse experienced by 
the two groups for the studied item with impact partialled out. In this sense, is similar 
conceptually to the Standardization index (Dorans & KuUck. 1986). ComputationaUy b^^, is 
expressed by 

i'«*-/;i7i(e)-r,(e)j/^(8)iie (9) 

where Tr(0) and Tp(e) are the marginal ICCs of the suspect item for the reference and focal 
groups respectively, given by (7) and ffid) is the 8-maiginal dttisity of the focal group. Shealy 
and Stout also have another index, b^, which is identical to (9) with the exception that the 
absolute value of the difference between the two marginal ICCs U computed. This index is 
designed for cases in which non-uniform bias would occur. 



o 

ERIC 



9 



9 

An advantage of using SIB, is that the practitioner can weight differaices in item 
performance at each score level by the proportion of examinees in focal group, or the reference 
group, or both. This feature, which is also pres^it in tiie Standardization procedure (Oorans & 
Kulick, 1986) is not present in the MH procedure. In this study the SIB statistic was always 
weighted by die proportion of focal group examinees r^iesented in the particular observed score 
cat^ory. 

One way to express die potential for bias from the Shealy-Stout perspective is by 
examining die difference between die expected values of die reference and focal group r\ !8 
conditional distributions. If Uus difference is not equal to zero at every 6, then diere is a 
potential for bias. By examining Uie expectation of diis diffierence it becomes qiute clear which 
differences in die underlying ability distributions will produce bias. That is, die difference 
between the expected vali» of die conditional distributions for a given value of 8 can be; 
expressed as 

It should be noted diat die potential for bias as given by (10) only reflects die size of die 
difference between die conditional expected values of le . If die conditional expected values 
are equal at every 6 for die two groups of interest, die potential for bias may still exist because 
higher order conditional moments need not be die same for die two groups (aldiough is probably 
unlikely in actual applications). That is, die potential for bias exists whenever die underiying 
ability distributions for die studied groups are not exacdy die same (more accurately (10) should 

be replaced by i^^le ^x\g\B)' Using die assumption of bivariate normality, one need only 
specify die first two moments to identify die underlying ability distributions exacdy. Thus, 
n^le f tl,l0 ifandonly if yhjei-«'h^0l and oJ^e-^^U* 

The Shealy and Stout twt statistics and die corresponding estimators. ^ . and ^ , are 
relatively new and do offer die researcher several advantages over die MH procedure and 



ERIC 



10 



10 

corresponding estimators Si^^ and k^g. They were developed from a multidimensional 
modeling perspective and emphasize the examination of bias at the test level. 

Method 

To study the relationship between level of test reliability as measured by KR-20 (a lower 
bound estimate) and the MH and SIB statistics, a monie carlo format was used. TTiere were four 
main factors of interest in the design: amount of bias, number of biased items, number of 
subjects in the reference and focal groups, and the level of test reliability. These are summarized 
in Figure 2. 



Insert Figure 2 about here 



A test with 40 items was simulated for each possible combination of the four factors. All 
vaHd items were measuring only 6. the purported skill. Specifically, all items except for the 
biased item(s) had an aji parameter equal to zero. Also, for each cell of this fully crossed design 
the, reference group and focal group examinees were randomly generated from ability 

distributions which had of and oj equal to 1.5 and .75, respectively. The centroid of the 
reference group was located at - 0.0. ji, - .75. The focal group was centered at . o.O. 

- 0.0. The two latent abilities. 0 and ^ , were correlated r « .4 for each group. A density 
contour for each group is displayed in Figure 3 



Insert Figure 3 about here 

It should be apparent from the preceding discussion that based upon the marginal 
distributions for each group, both should perform similarly on any item measuring only 8, but 
any item capable of measuring n would favor the reference group. 



11 



11 

The degree of bias was measured by the extent to which the biased item exploited the 
difference of the underlying r\ distributions for the reference and focal groups. This amount 
ranged from 0° (representing no bias) to 90" (representing the maximum potential for bias) in ten- 
degree increments. 

Valid, non-biased items measured only the first dimension (i.e., aj = 0.0). For the case 
in which there were no biased items, one of the valid items was randomly selected to be the 
suspect item. In the one-biased-item-case the M2PL parameters for the biased item aj and 32 
were specifically chosen to keep the MDISC value constant. As such, even though ten different 
mrasurement angles were selected for the biased item, the amount of overall discrimination 
(MDISC) was held constant at 1.5. The one-biased item was always given a d » 0.0 value. 
For the three-biased-items case, the biased items had the same parameters, identical to the values 
used in the one-biased-item case. 

There were three different levels for the number-of-bias-items factor: 0, 1, and 3. The 
cases in which no biased items were present were used to examine the Type I error rate of the 
MH and SIB statistics. Simulations which had one and three biased items were used to examine 
the power of each statistic. Only the power of SIB was estimated for the cas^s involving three 
biased items. 

The number-of-subjects fa;tor had six levels (reference n/focal n): 250/250, 500/250, 
1000/250, 500/500, 1000/500, 1000/1000. These ratios were selected to cover the approximate 
size of examinee populations that one might encounter in a national ttst administration, a state- 
wide assessment, or a large urban a:hool district setting. 

Four levels of reliabiUty were studied: .70, .80, .90, and .95. These four levels of 
reliability were selected to repriKent a range of reliabilities one might encounter using different 
types of tests such as a personality measure or an achievement test, A FORTRAN program was 
written to randomly select IRT item parameters which would provide the different levels of 
reliability for the two specified groups. Discrimination parameters were randomly selected from 
uniform distributions, each having a different range, and difficulty parameters from a N(0,1) 
distribution. By varying tiie spread of tiie a-parameter values, four 4a'itera tests having the 
specified KR-20 value were created by trial and error. Because tiie groups did not differ in their 
e abilities, at each reliability level tiie set of item parameters elicited tiie same level of reliability 



12 



12 

for each group. The effect of replacing one or three of the valid items with a biased item(s) thai 
took advantage of the n-ability differences, resulted in only slight differences (< .03) in group 
iftiabilities for the entire 40-item test. 

The research design was completely crossed with 720 possible testing conditions. For each 
testing condition, forty-item tests were randomly simulated for the reference and focal groups and 
the corxc^xMiding MH and SIB statistics were computed. This was rephcated 100 times for each 
condition so that empirical Type I error rates and correct rejection ratra could be calculated. A 
statistical level of significance (a) of .OS was used to test the null hypothesis of no difference in 
item performance against a non-directiona! alternative hypothesis. 

Results 

The means and variances of the M2PL a,i discrimination parameters (aji = 0.0) for each' 
level of reliability are shown in Table 1. These values are computed cm the 39 valid items used 
in the one-biased-item-case. It is inter^ng to note that as reliability increased the a^, 
discrimination parameters not only increased but became less variable. In a similar fashion, with 

each incremental raise in reliability the location of the items (as represented by ^) shifted 

towards the 6 mean of the underlying ability distribution and also became less variable. 



Insert Table 1 about here 



It should be noted that the M2PL model reduces to the unidimensional 2PL IRT model 
when aji = 0.0. Consequaitly, to gain more insight into how the items were altered to produce 
different levels of reliability, plots of the 2PL item characteristic curves (ICCs) were constructed 
at each level of reliability. These four plots are displayed in Figure 4. 



Insert Figure 4 about here 



13 



13 

As might be expected by using a reliability of internal consistency, the ICCs become moie 

homogeneous as reliability increases. At the highest levels of reliability, p^= .90 and = 

.95, the ICCs are essentially parallel (Rasch-like) and are located over the center of the 
underlying ability distributicm, indicated by the marginal density curve at the bottom of each plot. 

The two-dimensional perqwctive of the = .90 showing the item vectors is illustrated 

in Figures. In this figure a biased item with a SO** orientation is also displayed. Although less 
dramatic than the unidimensional ICCs in Figure 4, it helps to see the contrast of a test composed 
of strictly unidimensional items and one biased item. The greater the measurement angle with 
the positive 6-axis, the less the item contributes to the measurement of 6 and the greater its 

potential for bias. At 90*> t)» biased item is capable of measuring only the nuisance skill. 
Despite an item's edacity to discriminate between levels of the nuisance ability, bias can only, 
be realized if the two groups of interest differ in their i) ability. 



Insert Figure 5 about here 

A final series of plots were created to examine the change in expected observed score 
distribution as reliability was increased. These graphs parallel the woric by Lord (1953). Shown 
in Figure 6, each gr^h displays the test characteristic curve, the (coincid^it) marginal 6 
distribution of the focal and refniraice groups (below the 0 axis), and the distribution of the 
prc^xyrtion correct true score, C (to the left of tiie C axis). These plots are interesting because 
they graphically demonstrate what ha|^s to the expected raw score distributions as the 
reliability is increased. Based on the apparent pattern, cases of die very low reliability or 
extremely high reuability result in expected raw score distributions which have fewer observed 
score categories to contribute to the computation of the bias detection methods because of tiie 
clustering of subjects at particular score levels. 



Insert Figure 6 about here 



14 



14 

The results of the simulation runs for the no-biased item and one-biased item cases are 
displayed in Tables 2 and 3. The percent of correct rejections (o = .05) are displayed for each 
level of reliability for the ten different degree measures of the biased item, for each of the 
specified reference-focal group sample sizes. 

Insert Tables 2-3 about here 



The Type I error rate for both statistics appears to be less than the nominal .05 level with 
only a few exceptions. It does not appear to be influenced by the sample size ratio nor the level 
of reliability. Likewise, the diffei«ice between error rates for SIB and MH does not seem to 
follow a consistent pattern, nor does there seem to be a significant difference in magnitude. 

The pattern for power is clear. As was to be expected, when sample size was reduced 
the power of both statistics decreased. One way to measure this reduction is to note how large 
the angle of measuremmt of the suspect item had to be before the bias procedure was able to 
detrarmine the item was biased 100% of the time. As the sample size decreased the angle of the 
biased itt-ra at which 100% power was achieved increased for both statistics. For the MH 

statistic for the 1000/1000 case and - .95, 100% rejectioti rate was obtained for any biased 
item greater than or equal to 20" . In the case with the smallest sample size (250/250) and 
.95 this rate of rejection was not achieved until the suspect item had an angle of or 
larger. In a similar fashion, SIB achieved a perfect rejection rate for suspect items having an 
angle of 30* or more with sample sizes of 1000/1000 ( = .70, .90 and .95) and 1000/500 
(p^ = .70 and .80). For the 250/250 SIB achieved a 100% rejection rate for angles greater 
than or equal to 60* for p^ = .70, .90, and .95. 

For any one given sample size and angular direction of the biased item, the power of each 
statistic increased as the reliability increased, although not in a consistent manner. In the 

1000/250 case with the biased item at 20*" there was a drop at higher levels of reliability; the 

MH power rates were .66, .71, .68, and .66 for p^ values of .70, .80, ,90, and .95, 



15 



15 

respectively. For these same conditions the SIB power rates were .70, .72, .69, and .72. This 
drop in power occurs again for both statistics at the smallest simulated sample size, 250/250. 

Differences betiiveen the MH and SIB procedures in the rate of correct rejections always 
seem to be quite small and not consistently in one direction. The Iai:gest differences occur in the 

250/250 case with p^, = .90 and the biased item having angles of 20° (MH = .46, SIB « .56) 

and 30* (MH « .86 and SIB « .79). 

The results for the cases in which three items were biased are shown in Tables 4 and 5. 
Only the power of SIB was evaluated in these cases. The Type I error rate was less than the 

nominal .05 level with only a few exceptions, namely the 1000/250 case with = .90 for 

which the Type I error rate was .10. Averaged over all conditions the Type I error rate was 
about .04. 



Insert Tables 4 and 5 about here 

For comparable sample sizes and levels of reliability the rejection rates of SIB in the 
three-biased-item cases are much higher than the one-biased item cases. At each level of 
reliability, SIB achieves a 100% rejection rate when the three items have an angle of 20** or 
greater for sample sizes of 1000/1000, 1000/500, 500/500, and 500/250. For the smaUest 
sample size, this rate was achieved for angles greater than or equal to 4(f . In all cases as 
reliability increased power increased. 

Discussion 

The purpose of this study is to provide the practitioner with a set of guidelines concerning 
the power of the MH and SIB tests for various levels of reliability and sample sizes. It is 
intended to highlight conditions for which the MH and SIB statistics may yield inaccurate results. 
Based upon the results its appears tM. both procedures are about equally powerful, with the 

exception at 10* and 20* » where SIB appears to be slightly more powerful. The Type I error 



16 



16 

rates for each procedure seem to be below the nominal .05 level, and comparable at aU sample 
size levels. 

It appears that for small sample sizes there may be an range of reliability in which power 
is optimal. Ideally one would want to gather information at all levels of the raw score scale, but 
as seen in Figure 6, this may not be possible with very low or very high levels of reliability. 

In the multiple-biased item case SIB seems to perform quite weU. Because this is the wily 
known procedure which can examine multiple items at one time this is encouraging news. 
Practitioners should be encouraged to examine the effect of several biased items in concert, and 
SIB seems to be a good procedure for doing this. 

One also gains a sense of how *a)berant angle-wise an item would have to be before it 
would be consistentfy rejected as being biased. More work needs to be done to determine the 
relationship b&ween the angular difference and the substantive meaning of the item. In the real, 
world tests are not strictly unidimensional and thus an angular difference of 30" may not be 
considered large enough to make the item construct invalid. 

In conclusion, it is importent to note that there are a numbers of factors that were not 
investigated in this study. First, in a real testing situation aU test items never 
have vectors that Ue in the same direction. Ideally they will lie in a narrow sector (cf. 
Ackerman, 1992). How the width of this sector effects the power of each statistic is unknown. 
Second, it can also be assumed that non-uniform bias does exist in many testing situations. How 
this effects the correct rejection rates of the MH and SIB procedures at differwit levels of 
reliability and for biased items having different measurement angles also remains unknown. 
Third, one might choose to simulate increasing levels of reliability in a different manner. 
Specifically, by creating longer i«ts. Simulations done in this manner may not necessarily 
imitate reaUty because of the large number of items needed to achieve high levels of reliabiUty. 
However, such a method would not have as severe an effect on the observed score distribution 
as the manipulaticn of discrimination parameters which was done in this study. Finally, this 
study looked only at cases in which the test was equally reUable for both groups of interest. 
Conceivably this may not always be the case. 



17 



17 

Clearly bias research has just "scratched the surface" when it comes to understanding the 
capability of various pitxedures to successfully determine when an item is biased. One word of 
caution: in trying to detect biased items, one should never become to involved with the statistics 
and forget about the actual item. PracUtioners should never lose sight of all the factors that could 
inHuence the examinees' response patterns (e.g., the wording of an item, its format, its position, 
etc.). 



18 



Table I 

Means and standard deviations of M2PL discriminarion 
arameters for each level of teli:>|yi|^(y, 



KR-20 

ReliabUity ^ of, oj 



.70 


.68 


.28 


.45 


1.71 


.80 


.80 


.31 


.17 


1.43 


.90 


1.17 


.14 


-.19 


.79 


.95 


1.71 


.10 


-.15 


.62 



Note, n = 40 for each test > 0.0 for all items. 



19 



19 

Table 2 

Empirical Tvpc I error and power values for MH and SIB for samples <8iges of 1000/ IQQO, 

lOCQ/SQO and 1000/250 and the case of t biased item. 



.70 



Level of Reliability 
.80 .90 



.95 



Angle" 
Sample size of 
(Ref/Foc) Biased Item 



MH SIB MH SIB MH SIB MH SIB 



1000/1000 



1000/500 



1000/250 



otf* 


04 


03 


03 


04 


06 


05 


01 


05 


10 


53 


59 


54 


58 


56 


57 


56 


62 


20 


98 


98 


99 


99 


99 


m 


m 


♦ 


30 




m 


m 


m 






m 


m 


40 




m 






* 


* 


m 




50 














m 


m 


60 






m 






m 


m 


m 


70 


* 












m 




80 


* 




m 


m 








m 


90 




« 




m 


m 


m 




m 


00^ 


02 


03 


07 


05 


02 


03 


03 


05 


10 


37 


40 


38 


44 


37 


33 


41 


44 


20 


95 


97 


93 


92 


91 


92 


91 


92 


30 








m 


99 


99 


99 


99 


40 




* 






* 






m 


50 














m 




60 




m 


m 






* 




m 


70 






m 




m 




m 




80 


m 




m 


m 






m 


m 


90 


« 




m 




0 




« 


0 


OO'' 


03 


06 


04 


06 


03 


05 


06 


05 


10 


23 


31 


28 


32 


28 


27 


29 


26 


20 


66 


70 


71 


72 


68 




66 


72 


30 


90 


94 


93 


96 


97 


99 


96 


97 


40 


m 


96 


m 


m 


m 


m 


« 


m 


50 


m 




m 


m 




m 


m 


m 


60 




m 








m 


m 




70 












m 


m 


m 


80 






m 


m 


m 






0 


90 






m 


m 






m 





£lQte. Decimals are deleted. * denotes a value of 1 .0. 

"Angles are expressed in d^rees ^m the poative 6-axis 

'Draot^ the no bias case; correspmiding row represents Type I error rate 



ERIC 



20 



20 



Table 3 



Empirical Type I grror and nowcr values for MH and sib for sample ^ize& of soo/inon. 

SOOmO and 2SQ/2SQ and the nf } hi.^ ir^.^ 



Angle* 
Sample size of 
(Ref/Foc) Biased Item 



Level of Reliability 
.70 .80 .90 



.95 



MH SIB MH SIB MH SIB MH SIB 



500/500 



500/250 



250/250 



Otf' 


01 


03 


04 


06 


02 


05 


02 


03 


10 


20 


25 


28 


30 


24 


25 


29 


39 


20 


77 


78 


79 


82 


84 


84 


83 


82 


30 


98 


99 


98 


97 


99 


99 


99 


98 


40 


m 


1^ 


• 


m 






« 


m 




lit 


w 








m 


m 


m 


60 


m 


m 




m 






« 


m 


70 


m 


m 




m 


« 


m 




m 


80 


m 


m 


m 


m 


m 


m 


* 


m 


90 


m 


m 


m 


m 


m 


m 




m 


Otf* 


01 


03 


00 


01 


02 


03 


01 


02 


10 


13 


19 


22 


21 


17 


17 


18 


17 


20 


59 


67 


62 


66 


59 


64 


62 


63 


30 


93 


94 


96 


93 


90 


90 


92 


92 


40 


99 


99 


m 


« 




98 


98 


99 


50 


m 


« 


m 


m 


m 


99 






60 


m 


• 


m 


m 


m 


m 


« 




70 


m 


• 


m 


m 


m 


m 


m 




80 


m 


* 


m 


m 


m 


m 


m 


* 


90 


m 


m 


m 


m 


m 


m 


m 




00^ 


04 


04 


03 


06 


03 


01 


01 


03 


10 


13 


22 


16 


19 


12 


22 


10 


18 


20 


49 


49 


54 


56 


46 


56 


50 


49 


30 


84 


92 


90 


87 


86 


79 


82 


79 


40 


99 


96 


98 


98 


98 


94 


98 


98 


50. 




99 


m 




m 


99 


99 


94 


60 


• 


» 


m 




m 


m 






70 


• 


« 


m 




m 


m 


• 




80 


m 




m 


m 


m 


41 


* 


4r 


90 


m 




m 


m 


m 


m 


« 





Note. Decimals axe deleted. * doiotes a value of 1.0. 

'Angles axe expressed in d^iees from the positive 0-axis 

^Denotes the no bias case; corr^xwiding row represents Type I error rate 



21 



21 



Table 4 

Empirical Tvoe I error and nower values SI B for sample sizes of 
lOQQ/lOOO. lOQO/500 and 1000/15Q and the case of 3 biased item. 







Level of Reliability 














SEmoIe size 


of 










fRef/Foc) 


RiAic^ Ifpm 


.70 


.80 


.90 


.95 




00** 


04 


06 


05 


04 




10 


89 


93 


96 


95 




20 


m 


m 


m 


m 




30 


m 


m 


m 




1000/1000 


40 


m 




m 


* 




50 


m 


m 


m 


m 




60 


m 


m 


m 






70 


m 


* 


m 






80 


m 


m 




m 




90 


m 


m 


m 






otf* 

vrvr 


02 


04 


03 


02 




10 


79 


84 


83 


89 




20 






* 


m 




30 


m 


* 






1000/500 


40 


m 


m 








50 


m 


m 


« 


m 




ISO 


m 


m 


m 


m 




70 


m 


m 


m 


m 




80 


m 


m 




m 




90 


• 


m 


m 


m 




00** 


03 


06 


03 


05 




10 


51 


58 


59 


60 




20 


96 


98 


98 


98 




30 






m 


m 


1000/250 


40 


m 




m 


m 




50 


m 




m 


m 




60 


♦ 


« 


m 


« 




70 






m 






80 




m 


1$ 


m 




90 




« 


m 





Decimals are deleted, ^denot^a value of 1.0. 
'Angles are expressed in <tegrees from the positive 0-axis 
'Denotes the no bias case; corr^x>nding row represents Type I error rate 



ERIC 



22 



22 

Tables 

Empirical Tvdc I error and power values SIB for sample ^xtp^ 
5QQ/50Q. 500/250 aiid 2S0/250 and the case of 3 biased iten^s , 



Level of Reliability 

Angle' 

Sample size of 

(Ref/Foc) Biased Item .70 .80 .90 .95 



500/500 



500/250 



250/250 



otf* 


OS 


07 


07 


01 


10 


59 


66 


67 


74 


20 


m 


98 


m 




30 


m 


m 






40 


m 


m 


m 




50 


m 


m 


m 




60 


^ 


m 


m 




70 


m 








80 


m 


m 




♦ 


90 


m 


m 


m 


41 


00^ 


03 


10 


03 


02 


10 


47 


50 


48 




20 


94 


96 


95 




30 




m 






40 










50 


m 


m 


* 




60 


m 


m 






70 


m 


m 


« 




80 


« 


* 






90 


« 




« 




00^ 


04 


03 


03 


04 


10 


36 


37 


44 


36 


20 


88 


92 


89 


84 


^ 


99 


99 


98 


99 


40 


« 


m 


« 




50 




m 


41 




60 


m 


m 


4> 




70 


m 








80 


m 








90 


* 






41 



HqIS- Decimals are deleted. * denotes a value of 1.0. 

'Angles are expressed in degrees from the positive O-axis 

'Denotes the no bias case; coireq)onding row represents Type I error rate 



i?3 



References 



Ackerman, T. A. (1992). An explanation of differential item functioning from a multidimensional 
perspective. Journal of Educational MaasnrgmP^i -jl fnjis 

Dorans, N.J. & Kulick, E. M. (1986). Demonstrating tiie utility of the standardization approach 
to assessing differential item performance on the Scholastic Altitude Test. Journal of 
Educational Measurement 23, 355-368. 

Hambleton, R. K. & Swanunathan. H. (1985). Item response theorvr principles and annlir:>rinnc 
Boston. MA: Kluwer-Nijhoff Publishing. 

Holland, P. W. & Thayer, D.T. (1988). Differential item performance and the Maniel-Haenszel 
procedure. In H. Wainer and H.I. Braun (Eds.), Test Validity (pp. 129-145). Hillsdale, 
NJ: Lawrence Erlbaum Associates. 

Hunter, J. F. (1975, December). A critical analysis of the use nf it em means ami \^r^-f^ \ 

^gr^ons to determine the presfflgg afrsgnyy of conteit frigs in achievemen t tft ST 

llsms^ A papa- prrsoited at the National Institute of EdiK»tion Conferaice on Test Bias 
Annapolis, MD. 

Kok, F. (1988). Item bias and multidimensionality. In R. Langefaeine and J. Rost (Eds.), 
Utent trait and latent clas,^ ntn^ds. (j^. 263-274). New York, NY: Plenum Press. 

Lord, F. M. (1953). The relation of test score to the trait underlying the lest. Educational and 
Psychological Measuremei^t 547-549. 

Mislevy, R.J & Bock, R.D.(1983). BILOG: Item analysis and test «rnri ng with binary W Uri. 

mxam [Computer Program]. MooresviUe. IN: Sciaitific Software. 

Pine, S.M. (1977). Applications of item response tiieory to the problem of \eA bias. In D.J. 
Weiss (Ed.), Applications of computerizfld ariantive tesriny (Research Report 77-1). 
Mmiw^hs, MN: University of Minnesota, Psychomrtric Methods Program, Department 
of Psychology. 

Reckase, M.D. (1985 April). Thg (tifficultv of test items that measure more than nn^ :ih;i|^ 
Paper prraented at tiie annual nwcting of the American Educational Research Association 
Chicago, IL. 



Reckase, M.D. (19^, April). Tl» discriminating power of items that measnm mr^ty y^fln ITnf 
winffliaon . Paper presented at the annual meeting of tiK American Educational Research 
Association, San Francisco, CA. 



24 



24 



Shealy, R. & Stout, W. (in press). An item response theory model for test bias. (OMR Technical 
Report); In H. Waincr & P. Holland (Eds.), Differential Item Funcrioninp, Thp^p. m<\ 
EOfito, Hillsdale, NJ: L. Eribaum Associates. ~ i"r"".-- - — rnrni 



Shealy, R. & Stout, W. (1991). A procedure to detect test bias present simultaneously in 
several items. (Technical Report 91-3-ONR). Champaign, IL: University of Ulinois. 

Traub, R.E. (1983). A priori considerations in choosing an item response model. In R.K. 
Hambleton (Ed.), Aimlications of item Bespon«» t\^pf Vancouver, BC: Educaticmal 
Research Institute of British Columbia, 57-70. 



Zwick, R. (1990). Wh«j do item response function and Mantel-Haenszel definitions of 
differential item functioning coincide. Journal of Edur ^onal Mea«.n»npffl t, 3, 185.197. 



25 



25 

Footnotes 

'Both procedures can be used to detect either bias or DIF. It should be noted thai bias 
and DIF are distinct concepts; see Shealy and Stout (1991) for a careftd discussion of the 
difference. In this paper tlw term item bias will refers to situations where the user wishes to 
detect either bias or DIF. 



Figure Captions 

Figure 1. The item response surface and corresponding contour with the item vector 
for the M2PL parameters, a, = 1.8, 32^.3, d=.5. 

figure 2« Research design of the study detailing the four factors of interest which were fully 
crossed. 

Figure 3. A contour plot of the densities for the Reference and Focal groups with accompanying 
nai^ginal distributions. 

figured Unidimcnsional ICCs for the 39 valid items for tests having KR-20 reliabilities of 70 
.80, .90, and .95. ' ' 

figured A plot of the 40 item vectore for a KR-20 ReUabUity of .90 with one biased item 
having an effect size of 50*. 

EigUCeA A plot showing the distributional relationships between the generating 6 distribution 
and the proportion-correct true score distribution. 



ERIC 



26 



Figure 2 



Factor 1: Sample size - 6 levels 



(Ref N/Foc N) 




-30 




1000/1000 

1000/500 

1000/250 

500/500 

500/250 

250/250 



00 ta^.-^'''^ 30 
e 



Factor 2: Direction of biased item - 70 /ew/i 
(amount of bias) 



0* ^ 90* IB 10* increments 



Factor 3: <^ Reliability - 4 /eivZf 



30i 



to 



-3.0 



1 



1" 

-10 




-10 



-2 0- 



-3,0-^ 



T -* 

to 



KR-20 



.70 
.80 
.90 
.95 



-» — 



30 



Factor 4: Number of biased items - J levels 



30 



ao- 



V 

T — fte-"*- 1—1 

-i.D CO 1 0 



-3.J -ao -i.D CO 



-10- 



20 30 

6 



0 (To evaluate Type I error rates) 

1 (To evaluate Power of MH & SIB) 
3 (To evaluate Power of SIB) 



-2.0-1 



ERIC 



28 





Figure 6 



ERIC 



:\2 



